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USER-FRIENDLY  TAIL  BOUNDS 
FOR  MATRIX  MARTINGALES 

JOEL  A.  TROPP 


Abstract.  This  report  presents  probability  inequalities  for  sums  of  adapted  sequences  of  random, 
self-adjoint  matrices.  The  results  frame  simple,  easily  verifiable  hypotheses  on  the  summands,  and 
they  yield  strong  conclusions  about  the  large-deviation  behavior  of  the  maximum  eigenvalue  of  the 
sum.  The  methods  also  specialize  to  sums  of  independent  random  matrices. 


1.  Main  Results 

This  technical  report  is  a  companion  to  two  other  works,  the  papers  “User-friendly  tail  bounds  for 
sums  of  random  matrices”  [TrolOc]  and  “Freedman’s  inequality  for  matrix  martingales”  [TrolOa]. 
Since  this  report  is  intended  as  a  supplement,  we  have  removed  most  of  the  background  discussion, 
citations  to  related  work,  and  auxiliary  commentary  that  places  the  research  in  a  wider  context. 
We  recommend  that  the  reader  peruse  the  original  papers  before  studying  this  report. 

The  paper  [TrolOa]  describes  a  martingale  technique  that  leads  to  an  extension  of  Freedman’s 
inequality  in  the  matrix  setting,  which  is  similar  to  the  result  [OlilOa,  Thm.  1.2].  The  purpose  of 
this  work  is  to  show  how  the  arguments  from  [TrolOa]  allow  us  to  establish  the  matrix  probability 
inequalities  for  sums  of  independent  random  matrices  that  appear  in  [TrolOc].  The  discussion  here 
also  contains  some  new  probability  inequalities  for  sums  of  adapted  sequences  of  random  matrices; 
we  have  removed  these  results  from  the  other  two  papers  because  they  are  somewhat  specialized. 


1.1.  Roadmap.  The  rest  of  the  report  is  organized  as  follows.  The  balance  of  §1  provides  an 
overview  of  the  main  results  for  sums  of  independent  random  matrices.  Section  2  contains  the 
main  technical  ingredients  for  the  proof.  Sections  3-5  complete  the  proofs  of  the  matrix  probability 
inequalities  for  adapted  sequences.  Appendix  A  provides  an  overview  of  the  background  material 
that  we  require. 


1.2.  Rademacher  and  Gaussian  Series.  Let  ]]•]]  denote  the  usual  norm  for  operators  on  a 
Hilbert  space,  which  returns  the  largest  singular  value  of  its  argument,  and  let  Amax  denote  the 
algebraically  largest  eigenvalue  of  a  self-adjoint  matrix.  The  extreme  eigenvalues  of  a  Rademacher 
series  with  self-adjoint  matrix  coefficients  exhibit  normal  concentration. 


Theorem  1.1  (Matrix  Rademacher  and  Gaussian  Series).  Consider  a  finite  sequence  {A*,}  of  fixed 
self-adjoint  matrices  with  dimension  d,  and  let  {£&}  be  a  finite  sequence  of  independent  Rademacher 
variables.  Compute  the  norm  of  the  sum  of  squared  coefficient  matrices: 


G 


2  . 


(1.1) 
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For  all  t  >  0, 

In  particular, 

The  same  bounds  hold  when 
random  variables. 


£kAk)  >  *}  <  d  ■  e 


-t2/ 2a2 


(1.2) 


F{\\Lk^A4>t}<2d^-t2/'a2-  (L3) 

we  replace  {e*,}  by  a  finite  sequence  of  independent  standard  normal 


See  [TrolOc,  §4]  for  a  detailed  discussion  of  Theorem  1.1,  which  indicates  that  it  is  essentially 
sharp.  We  present  the  proof  in  §5. 


1.3.  Sums  of  Random  Semidefinite  Matrices.  Chernoff  bounds  describe  the  upper  and  lower 
tails  of  a  sum  of  nonnegative  random  variables.  In  the  matrix  case,  the  analogous  results  concern  a 
sum  of  positive-senridefinite  random  matrices.  The  matrix  Chernoff  bound  shows  that  the  extreme 
eigenvalues  of  this  sum  exhibit  the  same  binonrial-type  behavior  as  in  the  scalar  setting. 


Theorem  1.2  (Matrix  Chernoff).  Consider  a  finite  sequence  {X^.}  of  independent,  random,  positive- 
semidefinite  matrices  with  dimension  d.  Suppose  that 

Amax(^ffc)  <  R  almost  surely. 

Compute  the  eigenvalues  of  the  sum  of  the  expectations: 


/^min  • —  Anljn  (X/fe  ^  Xk)  ^  :=  A  max  (£  k  E  Xk )  ' 


Then 


(XXk )  -  }<d- 

\  ^  1 
(1  -d)1-^ 

(^2k  >  (1  +  <^)/^max|  E  d  ■ 

es 

(1  +  d)1+5_ 

/^max/R 


for  5  €  [0, 1),  and 


for  d  >  0. 


We  establish  Theorem  1.2  in  §3,  where  it  emerges  as  a  consequence  of  Theorem  3.1,  a  Chernoff 
inequality  for  sums  of  adapted  sequences  of  positive-senridefinite  matrices. 


1.4.  Adding  Variance  Information.  In  the  scalar  case,  a  well-known  inequality  of  Bernstein 
shows  that  the  sum  exhibits  normal  concentration  near  its  mean  with  variance  controlled  by  the 
variance  of  the  sum.  On  the  other  hand,  the  tail  of  the  sum  decays  subexponentially  on  a  scale 
determined  by  a  uniform  upper  bound  for  the  summands.  Sums  of  independent  random  matrices 
exhibit  the  same  type  of  behavior,  where  the  normal  concentration  depends  on  a  matrix  general¬ 
ization  of  the  variance  and  the  tails  are  controlled  by  a  uniform  bound  for  the  largest  eigenvalue  of 
each  summand. 


Theorem  1.3  (Matrix  Bernstein).  Consider  a  finite  sequence  {X^.}  of  independent,  random,  self- 
adjoint  matrices  with  dimension  d.  Suppose  that 

E  Xfc  =  0  and  Amax(Aj.)  <  R  almost  surely. 


Compute  the  norm  of  the  total  variance: 

For  all  t  >  0, 


a2  := 


XkE^ 

jAmax  (Xk  Xk )  -  -  d  '  6XP 


~t2/ 2  \ 
a2  +  Rt/3j  ’ 


The  matrix  Bernstein  inequality,  Theorem  1.3,  follows  from  a  more  detailed  result,  which  provides 
stronger  Poisson-type  decay  for  the  tail.  In  §4,  we  derive  these  results  from  a  martingale  result. 
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1.5.  Miscellaneous  Results.  The  methods  in  this  paper  deliver  a  number  of  other  results: 

•  All  of  the  results  described  in  the  front  matter  follow  from  more  general  bounds  for  large 
deviations  of  matrix  martingales.  See  §3-5  for  the  full  story. 

•  All  the  inequalities  we  have  mentioned,  with  exception  of  the  matrix  Chernoff  bounds,  have 
variants  that  hold  for  rectangular  matrices.  The  extensions  follow  immediately  from  the 
self-adjoint  case  by  applying  an  elegant  device  from  operator  theory,  called  the  self-adjoint 
dilation  of  a  matrix  [Pau86].  See  [TrolOc,  §4.2]  for  additional  details. 

2.  Tail  Bounds  via  Martingale  Methods 

This  section  contains  the  main  part  of  the  argument,  which  parallels  Freedman’s  argument  for 
producing  large  deviation  bounds  for  scalar  martingales  [Fre75].  The  material  here  duplicates  the 
note  [TrolOa]. 

2.1.  Matrix  Moments  and  Cumulants.  Consider  a  random  s.a.  matrix  X  that  has  moments 
of  all  orders.  By  analogy  with  the  classical  definitions  for  scalar  random  variables,  we  construct 
the  matrix  moment  generating  function  (mgf)  and  cumulant  generating  function  (cgf). 

Mx(9):=Eedx  and  Ex(<9)  :=  logEe0X  for  9  <E  R.  (2.1) 

The  mgf  has  a  formal  power  series  expansion  that  displays  the  raw  moments  of  the  random  matrix: 

r - yOO  Q3 

MX(9)  =  I  +  £.  --E(XJ). 

In  the  scalar  setting,  the  cgf  can  be  interpreted  as  an  exponential  mean ,  a  weighted  average  of 
a  random  variable  that  emphasizes  large  (positive)  deviations.  The  matrix  cgf  admits  a  similar 
intuition,  and  we  treat  it  as  a  measure  of  the  variability  of  a  random  matrix. 

2.2.  The  Large  Deviation  Supermartingale.  In  this  section,  we  extend  Freedman’s  martingale 
techniques  [Fre75]  to  the  matrix  setting.  The  matrix  cgf  and  Lieb’s  result,  Theorem  A.l,  play  a 
central  role  in  this  development. 

We  begin  with  a  filtration  {#*,  :  k  =  0, 1,  2, . . . }  of  a  master  probability  space,  and  we  write 
Efc  for  the  conditional  expectation  with  respect  to  ^k-  Consider  an  adapted  random  process 
{Xk  :  k  =  1,2,3,...}  and  a  previsible  random  process  {V}  :  k  =  1,2,3,...}  whose  values  are 
s.a.  matrices  with  dimension  d.  Suppose  that  the  two  processes  are  related  through  a  conditional 
cgf  bound  of  the  form 

logEfc_i  eeXk  4  g(9 )  •  14  almost  surely  for  9  >  0.  (2.2) 

The  function  g  :  (0,  oo)  — >  [0,oo],  and — for  simplicity — we  do  not  allow  this  function  to  depend  on 
the  index  k.  It  is  convenient  to  define  the  partial  sums  of  the  original  process  and  the  partial  sums 
of  the  conditional  cgf  bounds: 

T0:=O  and  Yk:=Y?  xj- 

W0  :=  0  and  Wk  :=  V*  Vv 

*-^j= i 

In  almost  all  our  examples,  {V}.}  is  a  sequence  of  psd  matrices,  and  so  {Wk}  increases  with  respect 
to  the  semidefinite  order.  The  random  matrix  Wk  can  be  viewed  as  a  measure  of  the  total  variability 
of  the  process  up  to  time  k. 

To  continue,  we  fix  the  function  g  and  a  positive  number  9.  Define  a  real-valued  function  with 
two  s.a.  matrix  arguments: 


Gg{Y,W)  :=  trexp  (9Y  -  g{9)  ■  W). 
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We  use  the  function  Gg  to  construct  a  real- valued  random  process. 

Sk  :=  Sk(0)  =  Gg(Yk,Wk)  for  k  =  0,1,2,....  (2.3) 

This  process  is  an  evolving  measure  of  the  discrepancy  between  the  partial  sum  process  {Tfc}  and 
the  cumulant  sum  process  {Wk}.  The  following  lemma  describes  the  key  properties  of  this  random 
sequence.  In  particular,  the  average  discrepancy  decreases  with  time.  The  proof  relies  on  Lieb’s 
result,  Theorem  A.l. 

Lemma  2.1.  For  each  fixed  6  >  0,  the  random  process  {Sk(9)  :  k  =  0, 1,  2, . . .  }  defined  in  (2.3)  is 
a  positive  supermartingale  whose  initial  value  So  =  d. 

Proof.  It  is  easily  seen  that  S k  is  positive  because  the  exponential  of  a  self-adjoint  matrix  is  pd, 
and  the  trace  of  a  pd  matrix  is  positive.  We  obtain  the  initial  value  from  a  short  calculation: 

So  =  trexp  (9Y0  -  W0{9))  =  trexp(Od)  =  trld  =  d. 

To  prove  that  the  process  is  a  supermartingale,  we  ascend  a  short  chain  of  inequalities. 

Efc_i  Sk  =  Efc_i  trexp  (dYk_ ,  -  g(9)  ■  Wk  +  logeeXfe) 

<  trexp  (0Yk_x  -  g{9)  ■  Wk  +  logEfc_i  eeXfe) 

<  trexp  (0Yk_i  -  g(0)  ■  Wk  +  g(9)  ■  Vk) 

=  trexp  (9Yk_i  -  g(9)  ■  Wk- 1) 

=  Sk_  i. 

In  the  first  step,  we  remove  the  term  Xk  from  the  partial  sum  Yk  and  rewrite  it  using  the  defini¬ 
tion  (A. 7)  of  the  matrix  logarithm.  Next,  we  invoke  Lieb’s  Theorem,  conditional  on  FP k-\ ,  to  verify 
the  concavity  of  the  function 

A  i — >  trexp  (01fc_i  -  g(9)  ■  Wk  +  log(A))  . 

We  apply  Jensen’s  inequality  (A. 9)  to  draw  the  conditional  expectation  inside  the  function.  This  act 
is  legal  because  Yk_ \  and  Wk  are  both  measurable  with  respect  to  ^k-\.  The  second  inequality  de¬ 
pends  on  the  assumption  (2.2)  together  with  the  fact  (A. 6)  that  the  trace  of  the  matrix  exponential 
is  monotone.  The  final  step  recalls  that  { Wk }  is  the  sequence  of  partial  sums  of  {14}.  □ 

Finally,  we  present  a  simple  inequality  for  the  function  Gg  that  holds  when  we  have  control  on 
the  eigenvalues  of  its  arguments. 

Lemma  2.2.  Suppose  that  Amax(lA)  >  y  and  that  Amax(VF)  <  w.  For  each  9  >  0, 

Gg(Y,W )  >  edy~9Ww. 

Proof.  Recall  that  g{9)  >  0.  The  bound  results  from  a  straightforward  calculation: 

Ge(Y,  W )  =  tr  e9Y-9^  w  >  tveeY^wI  >  Amax  . 

The  first  inequality  depends  on  the  fact  that  W  =4  wl  and  the  monotonicity  (A. 6)  of  the  trace 
exponential.  The  second  inequality  relies  on  the  property  (A.l)  that  the  trace  of  a  psd  matrix  is 
at  least  as  large  as  its  maximum  eigenvalue.  The  third  identity  follows  from  the  spectral  mapping 
theorem  and  elementary  properties  of  the  maximum  eigenvalue  map.  □ 
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2.3.  The  Main  Result.  Our  key  theorem  provides  a  bound  on  the  probability  that  the  partial 
sum  of  a  matrix-valued  random  process  is  large. 

Theorem  2.3.  Consider  an  adapted  sequence  {Xk}  and  a  previsible  sequence  {14}  of  self-adjoint 
matrices  with  dimension  d.  Assume  these  sequences  satisfy  the  relations 

logEfc.!  e9Xk  =4  g(6)  ■  V}.  almost  surely  for  each  6  >  0, 

where  g  :  (0,  oo)  — ►  [0,oo].  Define  the  partial  sums 

=-££,■*<  and 

For  all  t,  w  G  M, 

P  {3k  :  Amax(Yfc)  >  t  and  Amax(R4)  <  w}  <  d  •  inf  e~9t+9^w. 

e>o 

In  particular,  the  cumulant  hound  holds  when 

Efc_i  e9Xk  =4  eg(°l'Vk  almost  surely  for  each  9  >  0. 

Proof.  First,  note  that  the  cgf  hypothesis  holds  when 

Efc_i  e9Xk  =4  e9{e)'Vk 

because  of  the  operator  nronotonicity  (A. 8)  of  the  logarithm. 

The  strategy  for  the  main  argument  is  identical  with  the  stopping-time  technique  used  by  Freed¬ 
man  [Fre75].  Fix  a  positive  parameter  6,  which  we  will  optimize  later.  Following  the  discussion  in 
Section  2.2,  we  introduce  the  random  process  Sk  =  Gg{Yk,  Wk).  Lemma  2.1  implies  that  {Sk}  is  a 
positive  supermartingale  with  initial  value  d.  Let  us  emphasize  that  these  simple  properties  of  the 
auxiliary  random  process  distill  all  the  essential  information  from  the  hypotheses  of  the  theorem. 

Define  a  stopping  time  k  by  finding  the  first  time  instant  k  when  the  maximum  eigenvalue  of 
the  partial  sum  process  {14}  reaches  the  level  t  even  though  the  sum  of  cumulant  bounds  has 
maximum  eigenvalue  no  larger  than  w. 

k  :=  inf{k  >  0  :  Amax(14)  >  t  and  Amax(Wfc)  <  w}. 

When  the  infimum  is  empty,  the  stopping  time  k  =  oo.  Consider  a  system  of  exceptional  events: 

Ek  :=  {Amax(Yfc)  >  t  and  A max(  Wfc)  <  u>}  for  k  =  0, 1, 2, ... . 

Construct  the  event  E  :=  UfcLo  Ek  that  one  or  more  of  these  exceptional  situations  takes  place. 
The  intuition  behind  this  definition  is  that  the  partial  sum  14  is  typically  not  large  unless  the 
process  W}  has  varied  substantially,  a  situation  that  the  bound  on  Wk  disallows.  As  a  result, 
the  event  E  is  rather  unlikely. 

We  are  prepared  to  estimate  the  probability  of  the  exceptional  event.  First,  note  that  k  <  oo  on 
the  event  E.  Therefore,  Lemma  2.2  provides  a  conditional  lower  bound  for  the  process  {Sk}  at  the 
stopping  time  k: 

SK  =  Ge{YK ,  WK)  >  eet-3{0)w  Qn  the  event  E 
Since  E  Sk  <  d  for  each  (finite)  index  k, 

EOO  /* 

E [SK  I  k  =  k]  ■  P  {k  =  k}  =  E[S4  |  k  <  oo]  >  /  SK  dP 

k=1  J{K<  oo} 

>  f  SK  dP  >  F  (E)  ■  inf  E  SK  >  P  (E)  ■  e9t~9^w. 

J  E 

We  require  the  fact  that  SK  is  positive  to  justify  these  inequalities.  Rearrange  the  relation  to  obtain 

F(E)  <  d  ■  e~0t+9{9)w . 

Minimize  the  right-hand  side  with  respect  to  6  to  complete  the  main  part  of  the  argument.  □ 
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We  often  prefer  to  use  a  corollary  of  Theorem  2.3  that  describes  the  sum  of  a  finite  process.  This 
focus  allows  us  to  avoid  distracting  details  about  the  convergence  of  infinite  series. 

Corollary  2.4.  Suppose  the  hypotheses  of  Theorem  2.3  are  in  force,  and  suppose  the  random 
processes  are  finite  in  length.  Define 

For  all  t,  w  E  M, 

P  {Amax(T)  >  t  and  Amax(  W)  <  w}  <  d  •  inf  e~et+9^w. 


3.  Sums  of  Random  Semidefinite  Matrices 


In  this  section,  we  establish  Chernoff  inequalities  for  the  sum  of  an  adapted  sequence  of  ran¬ 
dom  psd  matrices.  This  result  extends  the  Chernoff  bounds  for  independent  random  matrices, 
Theorem  1.2,  that  we  presented  in  §1.3. 


Theorem  3.1  (Matrix  Chernoff:  Adapted  Sequences).  Consider  a  finite  adapted  sequence  {Af^,} 
of  positive-semidefinite  matrices  with  dimension  d,  and  suppose  that 

Amax(Ajfc)  <  R  almost  surely. 


Define  the  finite  series 
For  all  p,  >  0, 


and  W -,=  ^2k  Efc-i-Xfc. 


’{Amin 00  <  (1  -  S)fj,  and  A min(VR)  >  n}  <  d  ■ 


'  {Amax(^)  >  (1  +  5)p  and  A max(VR)  <  /a}  <  d  ■ 


.-<5 


L(1  - 

$  1  n/R 


fi/R 


L(i  +  ol+5J 


for  5  £  [0, 1),  and 
for  5  >  0. 


The  Chernoff  bound  for  independent  random  matrices,  Theorem  1.2,  follows  as  an  immediate 
corollary. 


Proof  of  Theorem  1.2  from  Theorem  3.1.  In  this  case,  we  assume  that  {AC,}  is  an  independent 
sequence  of  psd  matrices.  Then  the  matrix  W  is  not  random,  so  we  can  define  the  numbers 

/^min  *=  ^min  (W)  and  n max  :=  Amax(TT). 

As  a  consequence,  we  can  replace  ji  with  jn  or  /rmax,  as  appropriate,  and  remove  the  part  of  the 
event  involving  W  from  both  probabilities  in  Theorem  3.1.  □ 


3.1.  Proofs.  We  begin  with  a  semidefinite  bound  for  the  mgf  of  a  random  psd  matrix.  This 
argument  transfers  a  linear  upper  bound  for  the  scalar  exponential  to  the  matrix  case. 

Lemma  3.2  (Chernoff  mgf).  Suppose  that  X  is  a  random  psd  matrix  that  satisfies  Amax(X)  <  1. 
Then 

Ee“  =$  exp  ((ee  -  1)(EX)]  for  6  £  R. 

Proof.  Consider  the  function  f(x)  =  e  .  Since  /  is  convex,  its  graph  lies  below  the  chord  connecting 
two  points.  In  particular, 

f(x)  <  /( 0)  +  [/( 1)  -  /( 0)]  • x  for  x  £  [0, 1]. 


More  explicitly, 


e9x  <  1  +  (ee  —  1)  •  x  for  a;  £  [0, 1]. 
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Since  the  eigenvalues  of  X  lie  in  the  interval  [0, 1],  the  transfer  rule  (A. 3)  implies  that 

e9X  4  1+  (e9  -  1)X. 

Expectation  respects  the  semidefinite  order,  so 

EeflX  4l+{e9  -  1)(EI)  ^  exp  (V  -  1)(EX))  , 

where  the  second  relation  is  (A. 4).  □ 

We  prove  the  upper  Chernoff  bound  first,  since  the  argument  is  slightly  easier. 

Proof  of  Theorem  3.1,  Upper  Bound.  By  homogeneity,  we  may  assume  that  Amax(Xfc)  <  1;  the 
general  case  follows  by  re-scaling.  An  application  of  Lemma  3.2  demonstrates  that 

Efc_!  e9Xk  4  e9(e)'Efc-1  where  g{9)  =  e9  -  1  for  9  >  0. 

Corollary  2.4  provides  that 

P  {Amax  (jTk  X *)  >  U  +  S)fi  and  Amax  (JT,  Efc- 1  Xk)  <g)<d-  inf  e-^+^+^K 

The  infimum  is  achieved  when  6  =  log(l  +  5).  Substitute  and  simplify  to  complete  the  proof.  □ 

The  lower  Chernoff  bound  follows  from  a  similar  argument. 

Proof  of  Theorem  3.1,  Lower  Bound.  As  before,  we  may  assume  that  Amax(ACfc)  <  1.  This  time, 
we  intend  to  apply  Corollary  2.4  to  the  sequence  {— Xk}.  Lemma  3.2  demonstrates  that 

Efc_ie^^X,s  =<!  Q9^h^k-i(~ xk)  where  g(Q}  =  1  —  eTd  for  9  >  0. 

Corollary  2.4  delivers 

P  {Amax  (-  Y^k  Xk)  >  “(I  -  <5)/i  and  Amax  (-  ^fcEfc-!  Xfc)  <  -»}  <  d  •  inf  ems)-g(e))u 

Since  Amax(— A)  =  —  Amin(A)  for  each  s.a.  matrix  A,  we  can  draw  the  negation  out  of  the  eigenvalue 
maps  and  reverse  the  sense  of  the  inequalities  inside  the  probability.  Finally,  we  observe  that  the 
infimum  occurs  when  8  =  —  log(l  —  5).  □ 

4.  Incorporating  Variance  Information 

In  this  section,  we  establish  a  variant  of  the  Freedman  inequality  for  martingales  [Fre75,  Thm. 
(1.6)].  This  inequality  demonstrates  that  a  sum  of  random  matrices  has  normal  concentration 
around  its  mean  and  Poisson-type  decay  in  the  tails. 

Theorem  4.1  (Matrix  Bennett:  Adapted  Sequences).  Consider  a  finite  adapted  sequence  {A^.}  of 
self-adjoint  matrices  with  dimension  d  that  satisfy  the  relations 

Efc_i  Xk  =  0  and  Amax(Xfc)  <  R  almost  surely. 

Define  the  finite  series 

Y'=Hkx*  and  w-EiE‘-i(4). 

For  all  t  >  0  and  cr2  >  0, 

P  {Amax(Y)  >  t  and  Amax(LV)  <  a2}  <  d  ■  exp  •  h  |  • 

The  function  h{u)  :=  (1  +  u)  log(l  +  u)  —  u  for  u  >  0. 

We  obtain  a  Freedman-type  inequality  for  matrix  martingales  when  we  simplify  the  right-hand 
side  of  the  probability  bound  in  Theorem  4.1. 
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Corollary  4.2  (Matrix  Freedman).  Under  the  hypotheses  of  Theorem  f.l, 

IP  {Amax(Y)  >  t  and  Amax(VF)  <  a2}  <  d  •  exp  ^2  +  jft/3)  ' 

Proof.  This  corollary  is  a  direct  consequence  of  Theorem  4.1  and  the  numerical  inequality 

u2/ 2 


h(u)  =  (1  +  u)  log(l  +  u)  —  u  > 
which  can  be  obtained  by  comparing  derivatives. 


l  +  u/3 


for  u>  0, 


□ 


The  Bernstein  inequality.  Theorem  1.3,  for  sums  of  independent  random  matrices  follows  directly 
from  the  Freedman  inequality,  Corollary  4.2. 

Proof  of  Theorem  1.3  from  Corollary  f.2.  Indeed,  when  {X/,.}  is  an  independent  family  of  random 
matrices,  the  matrix  W  is  deterministic.  Therefore,  if  the  bound  a2  >  ||Wj|  holds,  then  it  holds 
almost  surely.  As  a  result,  we  can  remove  the  condition  on  W  from  the  probability  bound  in 
the  theorem.  We  can  derive  a  matrix  Bennett  inequality  from  Theorem  4.1  in  precisely  the  same 
manner.  □ 


The  proof  of  Theorem  4.1  appears  below.  Remark  4.4  shows  that  we  can  obtain  the  same  results 
if  we  are  provided  with  a  set  of  bounds  on  the  moments  of  the  summands. 

4.1.  Proofs.  The  first  lemma  shows  how  to  bound  the  mgf  of  a  zero-mean  random  matrix  using 
an  almost-sure  bound  for  its  largest  eigenvalue.  We  learned  this  argument  from  Yao-Liang  Yu. 

Lemma  4.3  (Bennett  mgf).  Suppose  that  X  is  a  random  s.a.  matrix  that  satisfies 

EX  =  0  and  Amax(X)  <  1  almost  surely. 


Then 


Ee“  =$  exp  (V  -0-1)-  E(X2)) 


for  6  >  0. 


Proof.  Fix  the  parameter  0  >  0,  and  define  a  continuous  function  /  on  the  real  line: 

—  Ox  —  1  02 

f(x)  =  - - y- -  for  X^°  aIld  /(°)  =  W 

xz  2 

An  exercise  in  differential  calculus  verifies  that  /  is  nonnegative  and  increasing.  The  matrix  X  has 
a  (random)  eigenvalue  decomposition  X  =  QAQ*  where  A  ^  I  almost  surely.  We  see  that 

/(X)  =  Qf(A)Q*  *  Q  ■  /(I)  •  Q*  =  /( 1)  •  I. 

Expanding  the  matrix  exponential  and  invoking  the  conjugation  rule  (A. 2),  we  discover  that 

eex  =  I  +  OX  +  X/(X)X  ^  I  +  OX  +  /( 1)  •  X2. 

To  complete  the  proof,  we  take  the  expectation  of  this  semidefinite  relation. 

Eeex  ^  I  +  /( 1)  •  E(X2)  =<:  exp  (/( 1)  •  E(X2)) 

The  final  step  follows  from  (A. 4).  □ 

We  are  ready  to  establish  the  Bennett  inequality  for  adapted  sequences  of  random  matrices. 

Proof  of  Theorem  f.l.  We  assume  that  R  =  1;  the  general  result  follows  by  re-scaling  since  Y  is 
1-homogeneous  and  W  is  2-homogeneous.  Invoke  Lemma  4.3  to  see  that 

Efc_i  eeXk  ^  exp  (, g(0 )  •  Ek_i  (XA2))  where  g(0)  =  ee  -  0  -  1. 

Corollary  2.4  implies  that 

ip {Amax (J2k Xk )  ^ t  and  Amax (Efe (x'k))  <°2}<d-  ™le<>t~9ie)a2 ■ 
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The  infimum  is  achieved  when  0  =  log(l  +  t/a2).  □ 

Remark  4.4.  We  can  also  establish  the  Bennett  mgf  bound  under  appropriate  hypotheses  on  the 
growth  of  the  moments  of  X.  This  argument  proceeds  by  estimating  each  term  in  the  Taylor  series 
of  the  matrix  exponential. 

Suppose  that  X  is  a  random  s.a.  matrix  with  E  X  =  0,  and  assume  the  moment  growth  bounds 

E(Xj)  ^  Rj~2  ■  A2  for  j  =  2,3,4,.... 

We  demonstrate  that 

E  eex  p  exp  (—  -  1  •  A2^  for  0  >  0. 

Indeed,  the  growth  condition  for  the  moments  yields  the  bound 


Ee 


ex 


I  +  0-EX  +  J] 

3= 2 


&  E(X*')  1  ORV  .  n 


R2 
=  1  + 


3=2 
e' 


m- OR-  1 


R2 


A2 


=<!  exp 


JR 


-OR-  1 
R2 


As  usual,  the  last  relation  follows  from  (A. 4). 


5.  Rademacher  and  Gaussian  Series 


This  section  establishes  normal  concentration  for  Rademacher  and  Gaussian  series  with  matrix 
coefficients.  The  first  step  is  to  verify  the  bounds  for  the  mgf  of  a  fixed  matrix  modulated  by  a 
Rademacher  variable  or  a  Gaussian  variable;  see  also  [OlilOb,  Lem.  2]. 

Lemma  5.1  (Rademacher  and  Gaussian  rngfs).  Suppose  that  A  is  an  s.a.  matrix.  Let  £  be  a 
Rademacher  random  variable,  and  let  7  be  a  standard  normal  random  variable.  Then 

Ke£dA  ^e9'2A2/2  and  Ee7^  =  e02^2  for  0  e  R. 


Proof.  By  absorbing  0  into  A,  we  may  assume  0  =  1  in  each  case.  We  begin  with  the  Rademacher 
mgf.  By  direct  calculation, 

EeEj4  =  cosh(A)  ^  eA 2^2, 

where  the  second  relation  is  (A. 5). 

Recall  that  the  moments  of  a  standard  normal  variable  are 


E(72j+1)  =  0  and  E(72j)  =  for  j  =  0, 1,  2, . . . 


Therefore, 


j!  2J 


(A2/2p  =  eA2/2 


J'- 


3= 1  v  7  3= 1 

The  first  identity  holds  because  the  odd  terms  in  the  series  vanish. 


□ 


We  immediately  obtain  the  bound  for  Rademacher  and  Gaussian  series. 

Proof  of  Theorem  1.1.  Let  {£&}  be  a  finite  sequence  of  independent  Rademacher  variables  or  inde¬ 
pendent  standard  normal  variables.  Invoke  Lemma  5.1  to  obtain 

=$  e02Al/2. 

By  assumption,  Amax(^fc  A2k)  <  a2  almost  surely.  Therefore,  Corollary  2.3  yields 

F  {Amax  (J2k  &Ak)  >  t}  <  d  ■  inf  e-0t+^2/2. 

The  infimum  is  attained  at  0  =  t/a2.  □ 
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Appendix  A.  Mathematical  Background 

This  section  provides  a  short  introduction  to  the  background  material  we  use  in  our  proofs. 
Section  A.l  discusses  matrix  theory,  and  Section  A. 2  reviews  some  relevant  ideas  from  probability. 

A.l.  Matrix  Theory.  Most  of  these  results  can  be  located  in  Bhatia’s  books  on  matrix  anal¬ 
ysis  [Bha97,  Bha07].  The  works  of  Horn  and  Johnson  [HJ85,  HJ94]  also  serve  as  good  general 
references.  Higham’s  book  [Hig08]  is  an  excellent  source  for  information  about  matrix  functions. 

A. 1.1.  Conventions.  A  matrix  is  a  finite,  two-dimensional  array  of  complex  numbers.  In  this  paper, 
all  matrices  are  square  unless  otherwise  noted.  We  add  the  qualification  rectangular  when  we  need 
to  refer  to  a  general  array,  which  may  be  square  or  nonsquare.  Many  parts  of  the  discussion  do  not 
depend  on  the  size  of  a  matrix,  so  we  specify  dimensions  only  when  it  matters.  In  particular,  we 
usually  do  not  state  the  size  of  a  matrix  when  it  is  determined  by  the  context. 

A. 1.2.  Basic  Matrices.  We  write  0  for  the  zero  matrix  and  I  for  the  identity  matrix.  Occasionally, 
we  add  a  subscript  to  specify  the  dimension,  e.g.,  1^  is  the  d  x  d  identity. 

A  matrix  that  satisfies  QQ*  =  I  =  Q*Q  is  called  unitary.  We  reserve  the  symbol  Q  for  a  unitary 
matrix.  The  symbol  *  denotes  the  conjugate  transpose. 

A  square  matrix  that  satisfies  A  =  A*  is  called  self-adjoint  (briefly,  s.a.).  We  adopt  Parlett’s 
convention  that  letters  symmetric  around  the  vertical  axis  (A,  H,  . . . ,  T)  represent  s.a.  matrices 
unless  otherwise  noted. 

A. 1.3.  The  Semidefinite  Order.  An  s.a.  matrix  A  with  nonnegative  eigenvalues  is  called  positive 
semidefinite  (briefly,  psd).  When  the  eigenvalues  are  strictly  positive,  we  say  the  matrix  is  positive 
definite  (briefly,  pd).  An  easy  consequence  of  the  definition  is  that 

Amax(A)  <  tr  A  when  A  is  psd  (A.l) 

because  the  trace  is  the  sum  of  the  eigenvalues. 

The  set  of  all  psd  matrices  with  fixed  dimension  forms  a  closed,  convex  cone.  Therefore,  we  may 
define  the  semidefinite  partial  order  on  s.a.  matrices  of  the  same  size  by  the  rule 

A  ^  H  ^=>  H  —  A  is  psd. 

In  particular,  we  may  write  A  ^  0  to  indicate  that  A  is  psd  and  A  0  to  indicate  that  A  is  pd. 
For  a  diagonal  matrix,  A  ^  0  means  that  each  entry  of  A  is  nonnegative. 

The  semidefinite  order  is  preserved  by  conjugation: 

A  ^  H  =►  B*AB  4  B*HB  for  each  matrix  B.  (A.2) 

We  refer  to  (A.2)  as  the  conjugation  rule. 

A. 1.4.  Matrix  Functions.  Let  us  describe  the  most  direct  method  for  lifting  functions  on  the  reals 
to  functions  on  s.a.  matrices.  Consider  a  function  /  :  R  — >  R.  First,  extend  /  to  a  map  on  diagonal 
matrices  by  applying  the  function  to  each  diagonal  entry: 

(/(a))jj  :=  /(Ajj)  for  each  index  j. 

We  extend  /  to  all  s.a.  matrices  by  way  of  the  eigenvalue  decomposition.  If  A  =  QAQ*,  then 

/(A)  =  f{QAQ*)  :=  Qf(A)Q*. 

The  spectral  mapping  theorem  states  that  each  eigenvalue  of  /(A)  has  the  form  /(A),  where  A  is 
an  eigenvalue  of  A.  This  point  is  obvious  from  our  definition. 

Inequalities  for  real  functions  extend  to  semidefinite  relationships  for  matrix  functions: 

f(a)  <  g(a)  for  a  E  /  =>■  /(A)  ^  g(A)  when  the  eigenvalues  of  A  lie  in  I. 


(A.3) 


TAIL  BOUNDS  FOR  MATRIX  MARTINGALES 


11 


Indeed,  let  us  decompose  A  =  QAQ*.  It  is  immediate  that  /(A)  =<!  g{ A).  Conjugate  by  Q ,  as 
justified  by  (A. 2),  and  invoke  the  definition  of  a  matrix  function.  We  sometimes  refer  to  (A. 3)  as 
the  transfer  rule. 

When  a  real  function  has  a  convergent  power  series  expansion,  we  can  also  define  an  s.a.  matrix 
function  via  the  same  power  series  expansion: 

/(a)  =  c0  +  =►  f(A)  ■=  col  +  ■  i  ci A?- 

z — *j=i  — 0=1 

In  this  case,  the  two  definitions  of  a  matrix  function  coincide. 

Beware:  One  must  never  take  for  granted  that  a  standard  property  of  a  real  function  generalizes 
to  the  associated  matrix  function. 


A. 1.5.  The  Matrix  Exponential.  We  may  define  the  matrix  exponential  of  an  s.a.  matrix  A  via  the 
power  series 


exp(A)  :=  eA 


Aj 

X 


The  exponential  of  an  s.a.  matrix  is  always  pd  because  of  the  spectral  mapping  theorem. 

On  account  of  the  transfer  rule  (A. 3),  the  matrix  exponential  satisfies  some  simple  semidefinite 
relations  that  we  collect  here.  Since  1  +  a  <  ea  for  real  a,  we  have 

I  +  A  =4  f°r  each  s.a.  matrix  A.  (A. 4) 

By  comparing  Taylor  series,  one  verifies  that  cosh(a)  <  efl2//2  for  real  a.  Therefore, 

cosh(A)  eA  /2  for  each  s.a.  matrix  A.  (A. 5) 

We  often  work  with  the  trace  of  the  matrix  exponential 

trexp  :  A  i — >  tre^. 

The  trace  exponential  is  monotone  with  respect  to  the  semidefinite  order: 

A^.H  ==>•  treA<treH.  (A. 6) 


See  [Pet 94,  Sec.  2]  for  a  short  proof  of  this  fact. 


A. 1.6.  The  Matrix  Logarithm.  The  matrix  logarithm  is  defined  as  the  functional  inverse  of  the 
matrix  exponential: 

log  (eA)  '■=  A  for  each  s.a.  matrix  A.  (A. 7) 

This  formula  determines  the  logarithm  on  the  pd  cone,  which  is  adequate  for  our  purposes.  The 
matrix  logarithm  is  monotone  with  respect  to  the  semidefinite  order. 

O^A^H  =►  log(A)  ^  log(JT).  (A. 8) 


A.  1.7.  A  Theorem  of  Lieb.  The  central  tool  in  this  paper  is  a  deep  theorem  of  Lieb  from  his 
seminal  1973  work  on  convex  trace  functions  [Lie73,  Thm.  6].  Epstein  provides  an  alternative  proof 
of  this  bound  in  [Eps73,  Sec.  II],  and  Ruskai  offers  a  simplified  account  of  Epstein’s  argument 
in  [Rus02,  Rus05].  For  another  approach  that  is  based  on  the  joint  convexity  of  quantum  relative 
entropy  [Lin74,  Lem.  2],  see  the  recent  note  [TrolOb]. 

Theorem  A.l  (Lieb).  Fix  a  self-adjoint  matrix  H .  The  function 

A  i — >  trexp(iT  +  log(A)) 

is  concave  on  the  positive- definite  cone. 

A. 2.  Probability.  We  continue  with  some  material  from  probability,  focusing  on  connections  with 
matrices.  Rogers  and  Williams  [RWOO]  is  our  main  source  for  information  about  martingales. 
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A. 2.1.  Conventions.  We  prefer  to  avoid  abstraction  and  unnecessary  technical  detail,  so  we  frame 
the  standing  assumption  that  all  random  variables  are  sufficiently  regular  that  we  are  justified  in 
computing  expectations,  interchanging  limits,  and  so  forth.  Furthermore,  we  often  state  that  a 
random  variable  satisfies  some  relation  and  omit  the  qualification  “almost  surely.”  We  reserve  the 
letters  V,  W .  X,  Y  for  random  s.a.  matrices. 

A. 2. 2.  Adapted  Sequences.  Let  (fi,  J^",P)  be  a  master  probability  space.  Consider  a  filtration  {^kj 
contained  in  the  master  sigma  algebra: 

.JAc  c  JL 

Given  such  a  filtration,  we  define  the  conditional  expectation 

Efc[-]  :=E[-  I  &h\- 

We  say  that  a  sequence  {X^}  of  random  matrices  is  adapted  to  the  filtration  when  each  Xk  is 
measurable  with  respect  to  Loosely  speaking,  an  adapted  sequence  is  one  where  the  present 
depends  only  upon  the  past. 

We  say  that  a  sequence  {14}  of  random  matrices  is  previsible  when  each  Vk  is  measurable  with 
respect  to  ^k- 1-  In  particular,  the  sequence  {IEfc-i  Xk}  of  conditional  expectations  of  an  adapted 
sequence  {X^}  is  previsible. 

A  stopping  time  is  a  random  variable  k  :  — ►  {0, 1, 2, . . . ,  oo}  that  satisfies 

{k  <  k}  C  ^k  for  k  =  0, 1, 2, . . . ,  oo. 

In  words,  we  can  determine  if  the  stopping  time  has  arrived  from  past  experience. 

A. 2. 3.  Matrix  Martingales.  We  say  that  an  adapted  sequence  {Yk  :  k  =  0, 1, 2, . . .  }  of  s.a.  matrices 
is  a  matrix  martingale  when 

Efc_i  Yk  =  lfc-i  for  k  =  1,  2, 3, - 

We  also  impose  an  L\  boundedness  criterion: 

E||Yfc||  <  oo  for  k  =  1,  2, 3, . . . . 

Since  all  norms  on  a  finite-dimensional  space  are  equivalent,  this  condition  is  the  same  as  the 
requirement  that  each  coordinate  of  each  matrix  Yk  is  integrable.  It  follows  that  we  obtain  a  scalar 
martingale  if  we  track  any  fixed  coordinate  of  the  sequence  {Yj,-}. 

Given  a  matrix  martingale  {F^,},  we  construct  the  difference  sequence 

Xk  :=  Yk  -  Yk- 1  for  fc  =  1,2,3, ... . 

Observe  that  the  difference  sequence  is  conditionally  zero  mean:  E^._i  Xk  =  0.  Alternatively,  we 
may  begin  with  an  adapted  sequence  {X*.}  of  conditionally  zero-mean  random  matrices  and  then 
form  the  partial  sum  process 

Y)  :=  0  and  Yk  :=  ^=]  W,, 

It  is  easy  to  verify  that  {Ifc}  is  a  martingale,  provided  that  the  integrability  requirement  holds. 

A. 2. 4.  Inequalities  for  Expectation.  Jensen’s  inequality  describes  how  averaging  interacts  with  con¬ 
vexity.  Let  Z  be  a  random  matrix,  and  let  /  be  a  real-valued  function  on  matrices.  Then 

E  f(Z)  <  /(EZ)  when  /  is  concave.  (A. 9) 

Since  the  expectation  of  a  random  matrix  can  be  viewed  as  a  convex  combination  and  the  psd  cone 
is  convex,  expectation  preserves  the  senridefinite  order: 

X  ^  Y  almost  surely 


EX  ^  EX. 
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Finally,  let  us  emphasize  that  each  of  these  bounds  holds  when  we  replace  the  expectation  E  by 
the  conditional  expectation  E^. 
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